Feature Selection and Customer Segmentation¶
Introduction¶
In this project, the goal is to analyse the data and create customer segmentation model in the end. There is a binary variable 'converted' in the data - if customer converted or not, Which has to play the crucial role for the clustering. The approach is to find clusters that will be different by conversion rates.
Objective¶
The primary objective of this work is to identify the key features that significantly contribute to predicting the converted variable. Subsequently, these features will be leveraged to segment customers effectively.
Approach¶
Feature Selection:
- Utilize several techniques such as Recursive Feature Elimination, feature importance from machine learning models, and domain knowledge to identify crucial features.
Customer Segmentation:
- Employ clustering algorithms such as K-means to group customers based on the selected features.
Model Evaluation:
- Assess the performance of the Customer Segmentation model.
import sys
sys.executable
'/home/roma/LENUS_TASK/Customer-Seg-Study/env/bin/python'
import os
import pandas as pd
import numpy as np
from pathlib import Path
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from umap import UMAP
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV
from tqdm import tqdm
import plotly.graph_objects as go
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import silhouette_score, adjusted_rand_score, homogeneity_score, completeness_score, v_measure_score
# Get the current directory
current_directory = Path.cwd()
!ls data
customer_data_sample.csv test
# The relative path to CSV file
csv_file_path = current_directory / "data" / "customer_data_sample.csv"
df = pd.read_csv(csv_file_path)
df.shape
(891, 10)
df.head()
| customer_id | converted | customer_segment | gender | age | related_customers | family_size | initial_fee_level | credit_account_id | branch | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15001 | 0 | 13 | male | 22.0 | 1 | 0 | 14.5000 | 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... | Helsinki |
| 1 | 15002 | 1 | 11 | female | 38.0 | 1 | 0 | 142.5666 | afa2dc179e46e8456ffff9016f91396e9c6adf1fe20d17... | Tampere |
| 2 | 15003 | 1 | 13 | female | 26.0 | 0 | 0 | 15.8500 | 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... | Helsinki |
| 3 | 15004 | 1 | 11 | female | 35.0 | 1 | 0 | 106.2000 | abefcf257b5d2ff2816a68ec7c84ec8c11e0e0dc4f3425... | Helsinki |
| 4 | 15005 | 0 | 13 | male | 35.0 | 0 | 0 | 16.1000 | 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... | Helsinki |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 891 non-null int64 1 converted 891 non-null int64 2 customer_segment 891 non-null int64 3 gender 891 non-null object 4 age 714 non-null float64 5 related_customers 891 non-null int64 6 family_size 891 non-null int64 7 initial_fee_level 891 non-null float64 8 credit_account_id 891 non-null object 9 branch 889 non-null object dtypes: float64(2), int64(5), object(3) memory usage: 69.7+ KB
1. Feature Meanings and Descriptions¶
| Field | Explanation |
|---|---|
customer_id |
Numeric id for a customer |
converted |
Whether a customer converted to the product (1) or not (0) |
customer_segment |
Numeric id of a customer segment the customer belongs to |
gender |
Customer gender |
age |
Customer age |
related_customers |
Numeric - number of people who are related to the customer |
family_size |
Numeric - size of family members |
initial_fee_level |
Initial services fee level the customer is enrolled to |
credit_account_id |
Identifier (hash) for the customer credit account. If customer has none, they are shown as "9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6c9bc9d493a23be9de0" |
branch |
Which branch the customer mainly is associated with |
Initial guessing before any analysis
In order to understand our data and make some initial guesses, we can observe each variable and try to understand their meaning. Since the goal of this task is to identify the most important features for predicting customer conversion, I would like to make some initial guesses before any analysis, based solely on common logic.
- Variable - age.
- Type - Demographic
- Explanation - Since the final end user product of Linus is not a traditional product, I think some generations would appreciate it more than others.
- Variable - initial_fee_level.
- Type - Segmential
- Explanation - This variable seems very similar to customer segments itself and has to be important.
- Variable - customer_segment.
- Type - Derived
- Explanation - customer_segment seems like a feature created based on data.
# Splitting the data into train and test sets
df, test = train_test_split(df, test_size=0.2, random_state=1)
test_dir_path = current_directory / "data" / "test"
if not os.path.exists(test_dir_path):
os.makedirs(test_dir_path)
test_file_path = current_directory / "data" / "test" / "test.csv"
test.to_csv(test_file_path)
1.1 Exploring features¶
numerical_feats = ['age', 'related_customers', 'family_size', 'initial_fee_level']
categorical_feats = ['customer_segment', 'gender', 'branch']
target = 'converted'
counts = df['converted'].value_counts()
# Create a bar plot with specified colors
fig = px.bar(x=counts.index, y=counts.values, color=counts.index,
labels={'x': 'Target', 'y': 'Count'}, title='Distribution of Target Variable')
fig.update_layout(xaxis_type='category', title_x=0.5) # Centering the title
fig.show()
Age
# Create two subplots for each converted category
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
# Plot for converted == 0
sns.histplot(df[df['converted'] == 0]['age'], kde=True, ax=axes[0], color='skyblue')
axes[0].set_title('Converted = 0')
# Plot for converted == 1
sns.histplot(df[df['converted'] == 1]['age'], kde=True, ax=axes[1], color='salmon')
axes[1].set_title('Converted = 1')
plt.tight_layout()
plt.show()
There are clients 0 and 1 years old ???
df['age'].value_counts()
age
24.00 23
30.00 22
19.00 22
22.00 20
18.00 19
..
0.42 1
66.00 1
0.67 1
20.50 1
0.75 1
Name: count, Length: 86, dtype: int64
Even 0.83 ???
Initial fee level
var = 'initial_fee_level'
data = pd.concat([df.converted, df[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=target, y=var, data=data, hue='converted')
fig.axis(ymin=0, ymax=300)
(-0.5, 1.5, 0.0, 300.0)
df[df['converted']==1]['family_size'].value_counts()
family_size 0 178 1 57 2 31 3 2 5 1 Name: count, dtype: int64
df['branch'].value_counts()
branch Helsinki 513 Tampere 133 Turku 64 Name: count, dtype: int64
until now we just used intuition, it means not much.subjective...
corr_feats = ['converted', 'customer_segment', 'age', 'related_customers', 'family_size', 'initial_fee_level']
#correlation matrix
corrmat = df[corr_feats].corr()
f, ax = plt.subplots(figsize=(6, 7))
sns.set(font_scale=1.25)
sns.heatmap(corrmat, vmax=.8, square=True, cbar=True, annot=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=corr_feats, xticklabels=corr_feats);
Most correlated features with the target are customer_segment and initial_fee_level But they are correlated to each other too
counts = df.groupby(['customer_segment', 'converted']).size().reset_index(name='count')
sns.kdeplot(x=df.customer_segment, y=df.converted, cmap="Blues", fill=True, bw_adjust=0.5)
plt.yticks([0, 1])
plt.xticks([11, 12, 13])
# Add annotations
for index, row in counts.iterrows():
plt.text(row['customer_segment'], row['converted'], row['count'], color='black', ha='center')
plt.show()
2. Missing Data¶
df.isnull().sum()
customer_id 0 converted 0 customer_segment 0 gender 0 age 144 related_customers 0 family_size 0 initial_fee_level 0 credit_account_id 0 branch 2 dtype: int64
df.shape
(712, 10)
Two features with missing values Age and branch
df.groupby(['branch', 'converted']).size().reset_index(name='count')
| branch | converted | count | |
|---|---|---|---|
| 0 | Helsinki | 0 | 344 |
| 1 | Helsinki | 1 | 169 |
| 2 | Tampere | 0 | 60 |
| 3 | Tampere | 1 | 73 |
| 4 | Turku | 0 | 39 |
| 5 | Turku | 1 | 25 |
# Definition of imputation strategies for each column
imputation_strategies = {
'age': 'mean',
'branch': 'most_frequent'
}
age_imputer = SimpleImputer(missing_values=np.nan, strategy=imputation_strategies['age'])
branch_imputer = SimpleImputer(missing_values=np.nan, strategy=imputation_strategies['branch'])
preprocessor = ColumnTransformer(
transformers=[
('age_imp', age_imputer, ['age']),
('branch_imp', branch_imputer, ['branch'])
])
df.loc[:, ['age', 'branch']] = preprocessor.fit_transform(df)
df.isnull().sum()
customer_id 0 converted 0 customer_segment 0 gender 0 age 0 related_customers 0 family_size 0 initial_fee_level 0 credit_account_id 0 branch 0 dtype: int64
3. Outliers treatment, Univariate approach¶
df.head()
| customer_id | converted | customer_segment | gender | age | related_customers | family_size | initial_fee_level | credit_account_id | branch | |
|---|---|---|---|---|---|---|---|---|---|---|
| 301 | 15302 | 1 | 13 | male | 30.166232 | 2 | 0 | 46.5000 | 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... | Turku |
| 309 | 15310 | 1 | 11 | female | 30.000000 | 0 | 0 | 113.8584 | e70ba215a23e2c438f86bc8ddf119c579b7bff180841c6... | Tampere |
| 516 | 15517 | 1 | 12 | female | 34.000000 | 0 | 0 | 21.0000 | 16ee13fe0dd987f3ef966e930adebd1e4f5d40f6180ac7... | Helsinki |
| 120 | 15121 | 0 | 12 | male | 21.000000 | 2 | 0 | 147.0000 | 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... | Helsinki |
| 570 | 15571 | 1 | 12 | male | 62.000000 | 0 | 0 | 21.0000 | 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... | Helsinki |
# Defining Interquartile Range
q1 = df.age.quantile(0.25)
q3 = df.age.quantile(0.75)
iqr = q3 - q1
up = q3 + 100 * iqr
low = q1 - 1.5 * iqr
df[(df.age < low) | (df.age > up)].shape
(6, 10)
Since I don't see big potential in Age variable, I dont want to remove lots of data point and at the same time I want to keep the Age
sns.set_style("whitegrid")
# Create the plot
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='initial_fee_level', bins=40, kde=True, color='lightblue', edgecolor='black')
# Add labels and title
plt.xlabel('initial_fee_level', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('initial_fee_level Distribution', fontsize=16)
# Add grid
plt.grid(True, linestyle='--', alpha=0.7)
# Show plot
plt.show()
No normal distribution !!! therefore, can be capped from the top
sns.boxplot(df["initial_fee_level"])
<Axes: ylabel='initial_fee_level'>
# Defining Interquartile Range for initial_fee_level
low = 0
up = 400
df[(df.initial_fee_level < low) | (df.initial_fee_level > up)].shape
(17, 10)
I Want to remove around ~ 2% of data that is `~ 17 rows.
df = df[df.initial_fee_level <= up]
df.shape
(695, 10)
4. Create new features¶
no_credit_account_value = '9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6c9bc9d493a23be9de0'
df.loc[:, 'has_credit_account'] = df['credit_account_id'].apply(lambda x: 0 if x==no_credit_account_value else 1)
fig = px.histogram(df, x=df['has_credit_account'], color=df['has_credit_account'])
fig.show()
/home/roma/LENUS_TASK/Customer-Seg-Study/env/lib/python3.9/site-packages/plotly/express/_core.py:2065: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
5. Feature encoding¶
df.head()
| customer_id | converted | customer_segment | gender | age | related_customers | family_size | initial_fee_level | credit_account_id | branch | has_credit_account | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 301 | 15302 | 1 | 13 | male | 30.166232 | 2 | 0 | 46.5000 | 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... | Turku | 0 |
| 309 | 15310 | 1 | 11 | female | 30.000000 | 0 | 0 | 113.8584 | e70ba215a23e2c438f86bc8ddf119c579b7bff180841c6... | Tampere | 1 |
| 516 | 15517 | 1 | 12 | female | 34.000000 | 0 | 0 | 21.0000 | 16ee13fe0dd987f3ef966e930adebd1e4f5d40f6180ac7... | Helsinki | 1 |
| 120 | 15121 | 0 | 12 | male | 21.000000 | 2 | 0 | 147.0000 | 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... | Helsinki | 0 |
| 570 | 15571 | 1 | 12 | male | 62.000000 | 0 | 0 | 21.0000 | 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... | Helsinki | 0 |
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# Fit LabelEncoder
label_encoder.fit(df['gender'])
LabelEncoder()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LabelEncoder()
# Transform
df.loc[:, 'gender'] = label_encoder.transform(df.gender)
# Labelencode 'branch' column by population of those cities , or by value counts
# Define custom order
custom_order = {'Turku': 0, 'Tampere': 1, 'Helsinki': 2}
df['branch'].value_counts()
branch Helsinki 507 Tampere 124 Turku 64 Name: count, dtype: int64
df.loc[:, 'branch'] = df['branch'].map(custom_order)
6. Feature selection¶
By carefully selecting important features, we enhance the accuracy of our Customer Segmentation model, ensuring meaningful and actionable insights for targeted business strategies.
modeling_feats = ['customer_segment', 'gender', 'age', 'related_customers', 'family_size',
'initial_fee_level', 'branch', 'has_credit_account']
6.1 Scaling¶
scaler = StandardScaler()
scaler.fit(df[modeling_feats])
print('variables mean values: \n' + 90*'-' + '\n' , scaler.mean_)
scaled_matrix = scaler.transform(df[modeling_feats])
variables mean values: ------------------------------------------------------------------------------------------ [12.3323741 0.65611511 30.14104317 0.48776978 0.35251799 53.09461439 1.63741007 0.21582734]
scaled_feats = ['scaled_'+feat for feat in modeling_feats]
df.loc[:, scaled_feats] = scaled_matrix
# Define transformation function for test data
test = pd.read_csv(test_file_path)
def preprocess_data(data,
imputation_strategies,
preprocessor,
label_encoder,
scaler):
# missing values
data.loc[:, ['age', 'branch']] = preprocessor.fit_transform(data)
# new feature
data.loc[:, 'has_credit_account'] = data['credit_account_id'].apply(lambda x: 0 if x==no_credit_account_value else 1)
# label encoding
data.loc[:, 'gender'] = label_encoder.transform(data.gender)
data.loc[:, 'branch'] = data['branch'].map(custom_order)
# scale
scaled_matrix = scaler.transform(data[modeling_feats])
data.loc[:, scaled_feats] = scaled_matrix
return data
test = preprocess_data(test, imputation_strategies, preprocessor, label_encoder, scaler)
PCA Analysis - its always important to check how many pcs explain 80% of variance
pca = PCA()
pca.fit(scaled_matrix)
pca_samples = pca.transform(scaled_matrix)
scaled_matrix.shape
(695, 8)
pca = PCA()
pca.fit(scaled_matrix)
# Get the explained variance ratio for each principal component
explained_variance_ratio = pca.explained_variance_ratio_
# Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
# Generate the plot
plt.figure(figsize=(10, 6))
# Plot explained variance ratio with customized colors
bars = plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.7, align='center', color='skyblue', label='Explained Variance Ratio')
# Plot cumulative explained variance with a different color
plt.plot(range(1, len(explained_variance_ratio) + 1), cumulative_explained_variance, color='orange', marker='o', linestyle='-', linewidth=2, label='Cumulative Explained Variance')
# Add values on top of each bar
for i, bar in enumerate(bars):
plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), f'{explained_variance_ratio[i]:.2f}', ha='center', va='bottom')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')
plt.title('Explained Variance and Cumulative Explained Variance by Principal Component')
plt.xticks(range(1, len(explained_variance_ratio) + 1))
plt.legend()
plt.grid(True)
# Customize background color
plt.gca().set_facecolor('lightgrey')
plt.show()
UMAP dimentionality reduction plot before feature selection , all features included
features = scaled_matrix
umap_2d = UMAP(n_components=2, init='random', random_state=0)
umap_3d = UMAP(n_components=3, init='random', random_state=0)
proj_2d = umap_2d.fit_transform(features)
proj_3d = umap_3d.fit_transform(features)
fig_2d = px.scatter(
proj_2d, x=0, y=1,
color=df.converted, labels={'color': 'converted'}
)
fig_3d = px.scatter_3d(
proj_3d, x=0, y=1, z=2,
color=df.converted, labels={'color': 'converted'}
)
fig_3d.update_traces(marker_size=5)
fig_2d.show()
fig_3d.show()
/home/roma/LENUS_TASK/Customer-Seg-Study/env/lib/python3.9/site-packages/umap/umap_.py:1943: UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism. /home/roma/LENUS_TASK/Customer-Seg-Study/env/lib/python3.9/site-packages/umap/umap_.py:1943: UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
6.2 RFE: Recursive Feature Elimination¶
Overview:
RFE (Recursive Feature Elimination) is a technique used for selecting the most important features from a dataset. It operates by iteratively training a model, ranking features by their importance, and eliminating the least important ones until the desired number is reached. This iterative process enhances model efficiency and interpretability.
Key Steps:
- Model Training: Train a model on the entire feature set.
- Feature Ranking: Rank features based on their importance scores.
- Feature Elimination: Recursively eliminate the least important features.
- Stopping Criterion: Stop when the desired number of features is reached or a predetermined criterion is met.
RFE helps streamline the feature selection process, leading to improved model performance and easier interpretation of results.
# Initialize the estimator
estimator = RandomForestClassifier(n_estimators=50, random_state=42)
# Perform Recursive Feature Elimination (RFE)
# In this example, we specify to keep 5 features
rfe = RFE(estimator, step=1)
rfe.fit(scaled_matrix, df.converted)
RFE(estimator=RandomForestClassifier(n_estimators=50, random_state=42))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RFE(estimator=RandomForestClassifier(n_estimators=50, random_state=42))
RandomForestClassifier(n_estimators=50, random_state=42)
RandomForestClassifier(n_estimators=50, random_state=42)
# Selected features after RFE
selected_features = rfe.support_
print("Selected Features:\n")
for feature, selected in zip(modeling_feats, selected_features):
print(f"{feature}: {'Selected' if selected else 'Not Selected'}")
Selected Features: customer_segment: Selected gender: Selected age: Selected related_customers: Not Selected family_size: Not Selected initial_fee_level: Selected branch: Not Selected has_credit_account: Not Selected
# Feature ranking after RFE (ranking of features by importance)
feature_ranking = rfe.ranking_
print("\nFeature Rankings:\n")
for feature, rank in zip(modeling_feats, feature_ranking):
print(f"{feature}: {rank}")
Feature Rankings: customer_segment: 1 gender: 1 age: 1 related_customers: 3 family_size: 4 initial_fee_level: 1 branch: 5 has_credit_account: 2
# Obtain feature importances from the estimator (Random Forest)
feature_importances = rfe.estimator_.feature_importances_
print("\nFeature Importances:\n")
for feature, importance in zip(modeling_feats, feature_importances):
print(f"{feature}: {importance}")
Feature Importances: customer_segment: 0.10554244028154942 gender: 0.26695115741618053 age: 0.30036380140865826 related_customers: 0.3271426008936118
6.3 Exploring Multiple Models for Feature Importance¶
Overview:
In this approach, I explored various models to determine feature importance. I selected three models: Support Vector Machine (SVM), Random Forest, and Linear Regression. Utilizing grid search, identified the best parameters for each model and calculated the average feature importances across the best models. Ultimately, obtained a consolidated view of feature importances derived from three distinct types of models.
Key Steps:
Model Selection:
- Chose three different models: SVM, Random Forest, and Linear Regression.
Grid Search:
- Performed grid search to find the best hyperparameters for each model.
Feature Importance Calculation:
- Calculated the average feature importances from the best models of each type.
This approach provides a comprehensive understanding of feature importance, leveraging insights from multiple modeling techniques.
X, y = scaled_matrix, df.converted
# Define models
models = {
#'SVM': SVC(),
'Random Forest': RandomForestClassifier(),
'Linear Classifier': LogisticRegression(max_iter=1000)
}
# Define parameter grids for tuning
param_grids = {
#'SVM': {'C': [0.1, 1, 10], 'gamma': [0.1, 0.01], 'kernel': ['linear', 'rbf']},
'Random Forest': {'n_estimators': [50, 200], 'max_depth': [3, 5, 10]},
'Linear Classifier': {'C': [0.1, 1, 10]}
}
# Dictionary to store best parameters
best_params = {}
# Dictionary to store feature importances
feature_importances = {}
# Loop over models
for name, model in models.items():
print(f"Training {name}...")
# Initialize progress bar
with tqdm(total=len(param_grids[name]), desc=f"{name} - Parameter Grid Search") as pbar:
# Loop over parameter grid
for param_set in param_grids[name]:
# Perform cross-validation and parameter tuning
clf = GridSearchCV(model, param_grids[name], cv=5)
clf.fit(X, y)
best_params[name] = clf.best_params_
# Train model with best parameters
best_model = clf.best_estimator_
best_model.fit(X, y)
# Extract feature importances
if name == 'Linear Classifier':
importance = np.abs(best_model.coef_[0])
else:
importance = best_model.feature_importances_
feature_importances[name] = importance
# Update progress bar
pbar.update(1)
pbar.set_postfix({'Best Params': clf.best_params_})
Training Random Forest...
Random Forest - Parameter Grid Search: 100%|█| 2/2 [00:12<00:00, 6.04s/it, Best
Training Linear Classifier...
Linear Classifier - Parameter Grid Search: 100%|█| 1/1 [00:00<00:00, 12.87it/s,
# Averaging feature importances
avg_importance = np.mean([value for value in feature_importances.values()], axis=0)
# Create dictionary with averaged feature importances
avg_feature_importances = {f'Feature_{i}': avg_importance[i] for i in range(len(avg_importance))}
fig = go.Figure(go.Bar(
x=modeling_feats,
y=avg_importance,
marker=dict(color='rgb(26, 118, 255)')
))
# Update layout for better visualization
fig.update_layout(
title='Feature Importances',
xaxis=dict(title='Features', tickangle=45),
yaxis=dict(title='Importance'),
template='plotly_white',
height=600 # Adjust the figure height as needed
)
# Show the plot
fig.show()
counts = df.groupby(['gender', 'converted']).size().reset_index(name='count')
sns.kdeplot(x=df.gender, y=df.converted, cmap="Blues", fill=True, bw_adjust=0.5)
plt.yticks([0, 1])
plt.xticks([0, 1])
# Add annotations
for index, row in counts.iterrows():
plt.text(row['gender'], row['converted'], row['count'], color='black', ha='center')
plt.show()
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
label_mapping
{'female': 0, 'male': 1}
clustering_feats = ['scaled_customer_segment', 'scaled_gender', 'scaled_age', 'scaled_related_customers',
'scaled_initial_fee_level', 'scaled_has_credit_account']
clustering_feats_indices = [True if feat in clustering_feats else False for feat in df.columns]
7. clustering algo¶
X_train, X_test = df.loc[:, clustering_feats], test.loc[:, clustering_feats]
choose optimal K
# Create the K means model for different values of K
def try_different_clusters(K, data):
cluster_values = list(range(1, K+1))
inertias=[]
for c in cluster_values:
model = KMeans(n_clusters = c,init='k-means++',max_iter=400,random_state=42)
model.fit(data)
inertias.append(model.inertia_)
return inertias
# Find output for k values between 1 to 12
outputs = try_different_clusters(12, X_train)
distances = pd.DataFrame({"clusters": list(range(1, 13)),"sum of squared distances": outputs})
# Finding optimal number of clusters k
figure = go.Figure()
figure.add_trace(go.Scatter(x=distances["clusters"], y=distances["sum of squared distances"]))
figure.update_layout(xaxis = dict(tick0 = 1,dtick = 1,tickmode = 'linear'),
xaxis_title="Number of clusters",
yaxis_title="Sum of squared distances",
title_text="Finding optimal number of clusters using elbow method")
figure.show()
Elbow rule - optimal K seems 5
# Initialize and fit KMeans on the train set
n_clusters = 5
clusterer = KMeans(n_clusters=n_clusters, random_state=42)
train_clusters = clusterer.fit_predict(X_train)
model = KMeans(n_clusters=n_clusters, random_state=42)
model.fit(df[['age', 'customer_segment']])
KMeans(n_clusters=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=42)
# Predict clusters for test set
test_clusters = clusterer.predict(X_test)
train = df.copy()
# Add the predicted clusters to the test DataFrame
train.loc[:, 'predicted_cluster'] = train_clusters
test.loc[:, 'predicted_cluster'] = test_clusters
import matplotlib.pyplot as plt
# Calculate conversion rates
conversion_rates = train.groupby('predicted_cluster')['converted'].mean()
# Sort clusters by increasing conversion rates
conversion_rates_sorted = conversion_rates.sort_values()
# Plot the conversion rates
ax = conversion_rates_sorted.plot(kind='bar')
plt.title('Conversion Rate by Predicted Cluster')
plt.xlabel('Predicted Cluster')
plt.ylabel('Conversion Rate')
# Add number of points for each bar
for i, v in enumerate(conversion_rates_sorted):
ax.text(i, v + 0.01, f'{train["predicted_cluster"].value_counts()[conversion_rates_sorted.index[i]]}', ha='center')
plt.show()
# Calculate conversion rate in each cluster for train set
train_conversion_rates = train.groupby('predicted_cluster')['converted'].mean()
# Sort clusters by increasing conversion rates for train set
train_conversion_rates_sorted = train_conversion_rates.sort_values()
# Calculate conversion rate in each cluster for test set
test_conversion_rates = test.groupby('predicted_cluster')['converted'].mean()
# Sort clusters by increasing conversion rates for test set
test_conversion_rates_sorted = test_conversion_rates.sort_values()
# Set the width of the bars
bar_width = 0.35
# Set the positions of the bars on the x-axis
r1 = np.arange(len(train_conversion_rates_sorted))
r2 = [x + bar_width for x in r1]
# Plot the conversion rates for both train and test sets
plt.bar(r1, train_conversion_rates_sorted, color='b', width=bar_width, label='Train')
plt.bar(r2, test_conversion_rates_sorted, color='r', width=bar_width, label='Test')
plt.title('Conversion Rate by Predicted Cluster')
plt.xlabel('Predicted Cluster')
plt.ylabel('Conversion Rate')
plt.xticks([r + bar_width/2 for r in range(len(train_conversion_rates_sorted))], train_conversion_rates_sorted.index, rotation=45)
plt.legend()
plt.show()
def evaluate_clustering(model, test_data):
# Predict clusters on the test data
predicted_labels = model.predict(test_data)
# Silhouette Score
silhouette = silhouette_score(test_data, predicted_labels)
print("Silhouette Score:", silhouette)
evaluate_clustering(clusterer, X_test)
Silhouette Score: 0.39507667082064724
Conclusion¶
Key Steps¶
Feature Analysis¶
Based on variable distributions tried to learn features from technical and business point of view.
Top Features Selection¶
Utilizing some algorithms and methodologies, a list of top features for effective customer segmentation was created. These features serve as essential components in understanding customer preferences and behavior patterns.
Data Clustering¶
Kmeans clustering technique was used to uncover hidden patterns and structures within the dataset. This step was the last step for developing a customer segmentation.
Conclusion¶
In conclusion, the customer segmentation model integrates feature analysis, top feature selection, and data clustering to provide valuable insights into customer behavior. By leveraging the customer_segment feature for training, the model enhances the understanding of customer segmentation dynamics.